Is Taylor Swift a Harbinger of Doom (R,SQL,Python)
Brandon Malady
Project Outline
Abstract: This project aims to explore the potential connection between Taylor Swift’s rise in Stardom and disaster in the United States. As Swift continues to dominate the airwaves, one must ask: could her soaring success be a smokescreen for something far more sinister?
Data Sources: FBI — Crime in the U.S. Center, Center for Disease Control and Prevention (National Center for Health Statistics) , Google Trends Data “Taylor Swift”
Aim 1: Explore the relationship between Taylor Swift’s popularity and online presence with Violent Crime rates over time in the United States.
Using data collected by the FBI we will explore the potential connection between Swift’s stardom and violent crime rates in the United States. Our data set from the FBI Is broken down per year. This will provide excellent temporal resolution. We will examine the aggregate Taylor Swift interest over time and violent crime rates to see if their is a connection.
Aim 2: Investigate deaths from drug overdoses in the United States and its correlation with Taylor Swift’s popularity over time.
Using Data collected by the CDC we will explore the relationship between Taylor Swift’s Popularity and drug overdose deaths in the United States. The data is broken down by year which will ensure we can examine temporal relationships between Taylor’s music and drug overdosages in the American Population.
Data Sources & Screenshots
Taylor Swift Google Trends Data
Taylor Swift Popularity Data: Source, Google Trends. Numbers represent search interest relative to the highest point on the chart for the United States from 2004-2024. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term.
FBI Crime Data
Violent Crime Rate Data: Source, FBI. Violent crime rate represent total reported violent crimes (murder, rape, aggravated assault, robbery) reported per 100,000 members of the popultion. I.E. Crime rate of 500 means 500 violent crimes reported per 100,000 population.
CDC Drug Data
Drug Overdose Data: Source, Center for Disease Control and Prevention, National Center for Health Statistics
Wrangled Data
Wrangled & Combined data of interest: 2004-2013 Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity. Sources: FBI, CDC, Google Trends “Taylor Swift”.
library(dplyr);
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyverse);── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.4
✔ ggplot2 3.5.1 ✔ stringr 1.5.0
✔ lubridate 1.9.2 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Taylor Swift Google Trends Data & Wrangling
#load in the Google Trends data on "Taylor Swift"
library(readr);
Taylor_Interest_Over_Time <- read_csv("Taylor_Interest_Over_Time.csv",
skip = 1)Rows: 250 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Month, taylor swift: (United States)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(Taylor_Interest_Over_Time)# A tibble: 250 × 2
Month `taylor swift: (United States)`
<chr> <chr>
1 2004-01 0
2 2004-02 0
3 2004-03 0
4 2004-04 0
5 2004-05 0
6 2004-06 0
7 2004-07 0
8 2004-08 0
9 2004-09 0
10 2004-10 0
# ℹ 240 more rows
#skips data label as this is not the data, but is part of the raw CSV format # Wrangle the data to be more usable
result <- Taylor_Interest_Over_Time %>% #Seperate Wider to get "YYYY" Column
# Split the "Month" column into year and month. Current format is "YYYY-MM"
separate("Month", into = c("year", "Month_new"), sep = "-", convert = TRUE) %>%
# Rename the popularity metric column
rename(Popularity = `taylor swift: (United States)`) %>%
# Convert "<1" to 0.5 in the Popularity column to avoid calculation errors. Approximation
mutate(Popularity = if_else(Popularity == "<1", 0.5, as.numeric(Popularity)))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Popularity = if_else(Popularity == "<1", 0.5,
as.numeric(Popularity))`.
Caused by warning in `if_else()`:
! NAs introduced by coercion
# Aggregate Taylor Swift Data monthly data to yearly sums
Taylor_swift_popularity_yearly_sums <- result %>%
group_by(year) %>%
summarise(total_count = sum(Popularity)) %>%
arrange(year);
# Print the result
print(Taylor_swift_popularity_yearly_sums)# A tibble: 21 × 2
year total_count
<int> <dbl>
1 2004 0
2 2005 0.5
3 2006 15
4 2007 52
5 2008 122
6 2009 265
7 2010 220
8 2011 194
9 2012 235
10 2013 220
# ℹ 11 more rows
# Generate a "synonym" by converting "2004" type values for year to
"Two Thousand and Four"[1] "Two Thousand and Four"
# Chat GPT was used with prompt "I want to convert the year column from displaying "2001" to "two thousand and one" for each year" to generate the following function
number_to_words <- function(num) {
ones <- c("", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine")
tens <- c("", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
teens <- c("ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
thousands <- num %/% 1000
remainder <- num %% 1000
result <- if (thousands > 0) paste(ones[thousands + 1], "thousand") else ""
if (remainder >= 100) {
hundreds <- remainder %/% 100
result <- paste(result, ones[hundreds + 1], "hundred")
remainder <- remainder %% 100
}
if (remainder > 0) {
if (nchar(result) > 0) result <- paste(result, "and")
if (remainder < 20) {
result <- paste(result, if (remainder < 10) ones[remainder + 1] else teens[remainder - 9])
} else {
result <- paste(result, tens[remainder %/% 10 + 1], ones[remainder %% 10 + 1])
}
}
return(trimws(result))
}
Taylor_swift_popularity_yearly_sums <- Taylor_swift_popularity_yearly_sums %>%
mutate(year = sapply(year, function(x) tools::toTitleCase(number_to_words(as.numeric(x)))))
print(Taylor_swift_popularity_yearly_sums)# A tibble: 21 × 2
year total_count
<chr> <dbl>
1 Two Thousand and Four 0
2 Two Thousand and Five 0.5
3 Two Thousand and Six 15
4 Two Thousand and Seven 52
5 Two Thousand and Eight 122
6 Two Thousand and Nine 265
7 Two Thousand and Ten 220
8 Two Thousand and Eleven 194
9 Two Thousand and Twelve 235
10 Two Thousand and Thirteen 220
# ℹ 11 more rows
The above table will be our pseudo synonym data source. “Two Thousand and Four” is a synonym for “2004” which we will wrangle before a subsequent Join operation into a master data table.
# Create a function to map the numeric year values to language format
year_to_numeric <- function(year_text) {
year_mapping <- c(
"Two Thousand and Four" = 2004, "Two Thousand and Five" = 2005,
"Two Thousand and Six" = 2006, "Two Thousand and Seven" = 2007,
"Two Thousand and Eight" = 2008, "Two Thousand and Nine" = 2009,
"Two Thousand and Ten" = 2010, "Two Thousand and Eleven" = 2011,
"Two Thousand and Twelve" = 2012, "Two Thousand and Thirteen" = 2013
)
return(year_mapping[year_text])
}
# Apply the language to numeric "YYYY" conversion
Taylor_swift_popularity_yearly_sums <-Taylor_swift_popularity_yearly_sums %>%
mutate(year = sapply(year, year_to_numeric))
# Display the result
print(Taylor_swift_popularity_yearly_sums)# A tibble: 21 × 2
year total_count
<dbl> <dbl>
1 2004 0
2 2005 0.5
3 2006 15
4 2007 52
5 2008 122
6 2009 265
7 2010 220
8 2011 194
9 2012 235
10 2013 220
# ℹ 11 more rows
This sentence format for years has been handled and now all our synonyms are ready for subsequent joins.
FBI Crime Data & Wrangling
#Load FBI Violent Crime Data
library(readxl)
FBI_violent_crime_per_year <- read_excel("FBI_violent_crime_per_year.xls",
skip = 3)New names:
• `` -> `...23`
• `` -> `...24`
print(FBI_violent_crime_per_year)# A tibble: 27 × 24
Year Population1 `Violent\ncrime2` `Violent \ncrime \nrate`
<chr> <dbl> <dbl> <dbl>
1 2000 281421906 1425486 506.
2 20015 285317559 1439480 504.
3 2002 287973924 1423677 494.
4 2003 290788976 1383676 476.
5 2004 293656842 1360088 463.
6 2005 296507061 1390745 469
7 2006 299398484 1435123 479.
8 2007 301621157 1422970 472.
9 2008 304059724 1394461 459.
10 2009 307006550 1325896 432.
# ℹ 17 more rows
# ℹ 20 more variables: `Murder and\nnonnegligent \nmanslaughter` <dbl>,
# `Murder and \nnonnegligent \nmanslaughter \nrate` <dbl>,
# `Rape\n(revised \ndefinition)3` <dbl>,
# `Rape\n(revised \ndefinition) \nrate3` <dbl>,
# `Rape\n(legacy \ndefinition)4` <dbl>,
# `Rape\n(legacy \ndefinition) \nrate4` <dbl>, Robbery <dbl>, …
#get only the data of interest (Year & Crime Rate)
new_data_frame <- FBI_violent_crime_per_year |>
select("Year", "Violent \ncrime \nrate") |>
rename(year = Year, crime_rate = `Violent \ncrime \nrate`)# Drop the last 7 rows as these are footnotes in the raw data table (see png)
FBI_Crime_Rate <- head(new_data_frame, -7)
# Fix the typos in the year column (typos introduced by footnotes)
FBI_Crime_Rate$year <- ifelse(FBI_Crime_Rate$year == 20186, 2018, FBI_Crime_Rate$year)
FBI_Crime_Rate$year <- ifelse(FBI_Crime_Rate$year == 20015, 2001, FBI_Crime_Rate$year)# Convert the year column to numeric for subsequent join
FBI_Crime_Rate$year <- as.numeric(FBI_Crime_Rate$year)
# View the result
print(FBI_Crime_Rate)# A tibble: 20 × 2
year crime_rate
<dbl> <dbl>
1 2000 506.
2 2001 504.
3 2002 494.
4 2003 476.
5 2004 463.
6 2005 469
7 2006 479.
8 2007 472.
9 2008 459.
10 2009 432.
11 2010 404.
12 2011 387.
13 2012 388.
14 2013 369.
15 2014 362.
16 2015 374.
17 2016 387.
18 2017 384.
19 2018 370.
20 2019 367.
# Rename columns in the combined dataframe. Combining Taylor Swift and FBI data
combined_data <- FBI_Crime_Rate %>%
inner_join(Taylor_swift_popularity_yearly_sums, by = "year") %>%
rename(
`FBI Violent Crime` = crime_rate,
`Taylor Swift Popularity` = total_count
)
# View the result
print(combined_data)# A tibble: 10 × 3
year `FBI Violent Crime` `Taylor Swift Popularity`
<dbl> <dbl> <dbl>
1 2004 463. 0
2 2005 469 0.5
3 2006 479. 15
4 2007 472. 52
5 2008 459. 122
6 2009 432. 265
7 2010 404. 220
8 2011 387. 194
9 2012 388. 235
10 2013 369. 220
CDC (National Center for Health Statistics) Drug Data & Wrangling
Drug Overdose Data: Source, Center for Disease Control and Prevention, National Center for Health Statistics
# Drug_Use_Data
# Load and preview the CDC Drug Data
library(readxl)
Overdose_table <- read_excel("Overdose table.xlsx",
skip = 2) #Skip top labels as this is not the data New names:
• `Number of deaths` -> `Number of deaths...2`
• `Deaths per 100,000` -> `Deaths per 100,000...3`
• `` -> `...4`
• `` -> `...5`
• `Number of deaths` -> `Number of deaths...6`
• `` -> `...7`
• `Deaths per 100,000` -> `Deaths per 100,000...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `Number of deaths` -> `Number of deaths...12`
• `` -> `...13`
• `Deaths per 100,000` -> `Deaths per 100,000...14`
library(dplyr)
library(stringr)
# Remove "....." from year entries from raw formatting
Overdose_table <- Overdose_table %>%
mutate(Year = as.integer(str_replace_all(Year, "[^\\d]", ""))) #convert to numeric for subsuquent joinlibrary(dplyr)
# Rename the "Year" column to "year"
Overdose_table <- Overdose_table %>%
rename(year = Year) #making cases the same for subsequent Join
# Display the first few rows
print(Overdose_table)# A tibble: 21 × 14
year `Number of deaths...2` `Deaths per 100,000...3` ...4 ...5
<int> <dbl> <dbl> <lgl> <lgl>
1 2001 19394 6.8 NA NA
2 2002 23518 8.2 NA NA
3 2003 25785 8.9 NA NA
4 2004 27424 9.4 NA NA
5 2005 29813 10.1 NA NA
6 2006 34425 11.5 NA NA
7 2007 36010 11.9 NA NA
8 2008 36450 11.9 NA NA
9 2009 37004 11.9 NA NA
10 2010 38329 12.3 NA NA
# ℹ 11 more rows
# ℹ 9 more variables: `Number of deaths...6` <dbl>, ...7 <lgl>,
# `Deaths per 100,000...8` <dbl>, ...9 <lgl>, ...10 <lgl>, ...11 <lgl>,
# `Number of deaths...12` <dbl>, ...13 <lgl>, `Deaths per 100,000...14` <dbl>
#Extract only Total deaths and Year columns
Overdose_table_subset <- Overdose_table %>%
select(1:2) %>%
rename(year = 1, number_of_deaths = 2) # Rename columns for clarity
# Display the first few rows of the new data frame
print(head(Overdose_table_subset))# A tibble: 6 × 2
year number_of_deaths
<int> <dbl>
1 2001 19394
2 2002 23518
3 2003 25785
4 2004 27424
5 2005 29813
6 2006 34425
Combined Data from all 3 sources, Master Table for Analysis
library(dplyr)
# Join with Previous combined data with CDC data to create master table
TaylorSwift_ViolentCrime_DrugOverdose <- combined_data %>%
left_join(Overdose_table_subset, by = "year") %>%
rename(`Drug Overdoses` = number_of_deaths)
# Display the first few rows of the joined dataframe
print(TaylorSwift_ViolentCrime_DrugOverdose)# A tibble: 10 × 4
year `FBI Violent Crime` `Taylor Swift Popularity` `Drug Overdoses`
<dbl> <dbl> <dbl> <dbl>
1 2004 463. 0 27424
2 2005 469 0.5 29813
3 2006 479. 15 34425
4 2007 472. 52 36010
5 2008 459. 122 36450
6 2009 432. 265 37004
7 2010 404. 220 38329
8 2011 387. 194 41340
9 2012 388. 235 41502
10 2013 369. 220 43982
Combined data of interest: (2004-2019) Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”
Data Analysis & Visualization
#Scale all data to range from 0-100 for graphical visualization of relative trends per dataset
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(tidyr)
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = year, y = value, color = variable)) +
geom_line(size = 1) + # Line plot
geom_point(size = 3) + # Add points for each data point
labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses Over Time",
x = "Year", y = "Scaled Values") +
theme_minimal() + # Clean theme
scale_color_manual(values = c("red", "pink", "black")) + # Custom colors
theme(legend.title = element_blank()) + # Remove legend title
theme(panel.grid.major = element_line(color = "gray", size = 0.5)) # Add gridlinesWarning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
Figure(1) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”
#Same comparison but visualized with grouped bar chart
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the grouped bar plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses by Year",
x = "Year",
y = "Scaled Values") +
scale_fill_manual(values = c("FBI Violent Crime" = "red",
"Taylor Swift Popularity" = "pink",
"Drug Overdoses" = "black"),
name = "Metric") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom") +
scale_y_continuous(limits = c(0, 100)) +
geom_text(aes(label = round(value, 1)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3)Figure(2) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
mutate(across(-year, scale_0_100))
# Step 2: Select only the columns for 'Taylor Swift Popularity' and 'Drug Overdoses'
scaled_data_subset <- scaled_data %>%
select(year, `Taylor Swift Popularity`, `Drug Overdoses`)
# Step 3: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data_subset %>%
pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 4: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses by Year",
x = "Year",
y = "Scaled Values") +
scale_fill_manual(values = c("Taylor Swift Popularity" = "pink",
"Drug Overdoses" = "black"),
name = "Metric") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom") +
scale_y_continuous(limits = c(0, 100)) +
geom_text(aes(label = round(value, 1)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3)Figure(3) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, Substance Abuse and Mental Health Services Administration, Google Trends “Taylor Swift
Here we see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.
#Show correlation between drug overdose and popularity with a scatterplot/trendline
# Create the plot
ggplot(scaled_data, aes(x = year)) +
# Scatter plot for Taylor Swift Popularity
geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
# Scatter plot for Drug Overdoses
geom_point(aes(y = `Drug Overdoses`, color = "Drug Overdoses"), size = 3, shape = 15) +
# Fitted line for Taylor Swift Popularity
geom_smooth(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"),
method = "lm", se = FALSE, linetype = "dashed") +
# Fitted line for Drug Overdoses
geom_smooth(aes(y = `Drug Overdoses`, color = "Drug Overdoses"),
method = "lm", se = FALSE, linetype = "dashed") +
# Customize colors
scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "Drug Overdoses" = "black")) +
# Labels and title
labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses Over Time",
x = "Year",
y = "Scaled Value (0 to 100)",
color = "") +
# Customize the theme
theme_minimal() +
theme(legend.position = "top",
panel.grid.major = element_line(color = "gray", linetype = "dotted"),
axis.text.x = element_text(angle = 0, hjust = 0.5)) +
# Set y-axis limits
scale_y_continuous(limits = c(0, 100)) +
# Set x-axis to show all years
scale_x_continuous(breaks = scaled_data$year)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_smooth()`).
Figure(4) Scatterplot with trendline showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2013. Sources: FBI, CDC, Google Trends “Taylor Swift
We again see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.
# Create the plot
ggplot(scaled_data, aes(x = year)) +
# Line and points for Taylor Swift Popularity
geom_line(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 1) +
geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
# Line and points for Violent Crime Rate
geom_line(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 1) +
geom_point(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 3, shape = 15) +
# Customize colors
scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "FBI Violent Crime" = "red")) +
# Labels and title
labs(title = "Scaled Values of Taylor Swift Popularity and FBI Violent Crime Over Time",
x = "Year",
y = "Scaled Value (0 to 100)",
color = "") +
# Customize the theme
theme_minimal() +
theme(legend.position = "top",
panel.grid = element_blank(), # Remove all grid lines
axis.text.x = element_text(angle = 0, hjust = 0.5)) +
# Set y-axis limits
scale_y_continuous(limits = c(0, 100)) +
# Set x-axis to show all years
scale_x_continuous(breaks = scaled_data$year)Figure(5) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2013. Sources: FBI, CDC, Google Trends “Taylor Swift
Shockingly Ms. Swifts Music seems to have a calming affect on American citizens. Perhaps Taylor is sending Americans into a drug-ridden stupor.
Altogether,this work provides unequivocal evidence of Taylor Swifts corruption of America. Ms. Swift’s music lulls masses into a a drug ridden complaceny.
Future work needs to investigate the negative correlation between Ms. Swift’s music and Violent Crime Rate in America. One plausible explanation is that Ms. Swift’s music triggers downregulation of testosterone production in males, leading to reduced violent crime.
library(DBI)
library(RSQLite)
library(readxl)# Create a connection to an SQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")# Load Taylor Swift Popularity Data
Taylor_Interest <- read.csv("Taylor_Interest_Over_Time.csv", skip = 1)
dbWriteTable(con, "Taylor_Interest", Taylor_Interest)
# Load FBI Violent Crime Data
FBI_Crime <- read_excel("FBI_violent_crime_per_year.xls", skip = 3)New names:
• `` -> `...23`
• `` -> `...24`
dbWriteTable(con, "FBI_Crime", FBI_Crime)
# Load Drug Overdose Data
Overdose_Data <- read_excel("Overdose table.xlsx", skip = 2)New names:
• `Number of deaths` -> `Number of deaths...2`
• `Deaths per 100,000` -> `Deaths per 100,000...3`
• `` -> `...4`
• `` -> `...5`
• `Number of deaths` -> `Number of deaths...6`
• `` -> `...7`
• `Deaths per 100,000` -> `Deaths per 100,000...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `Number of deaths` -> `Number of deaths...12`
• `` -> `...13`
• `Deaths per 100,000` -> `Deaths per 100,000...14`
dbWriteTable(con, "Overdose_Data", Overdose_Data)-- Get the Raw Taylor Swift Google Trends data
SELECT * FROM Taylor_Interest| Month | taylor.swift…United.States. |
|---|---|
| 2004-01 | 0 |
| 2004-02 | 0 |
| 2004-03 | 0 |
| 2004-04 | 0 |
| 2004-05 | 0 |
| 2004-06 | 0 |
| 2004-07 | 0 |
| 2004-08 | 0 |
| 2004-09 | 0 |
| 2004-10 | 0 |
-- Seperate Wider to get year and month columns. Handle <1 values with 0.5 Approximation for calculations
CREATE TABLE Taylor_Interest_Transformed AS
SELECT
CAST(substr(Month, 1, 4) AS INTEGER) AS year,
CAST(substr(Month, 6, 2) AS INTEGER) AS Month_new,
CASE
WHEN `taylor.swift...United.States.` = '<1' THEN 0.5
ELSE CAST(`taylor.swift...United.States.` AS REAL)
END AS Popularity
FROM
Taylor_Interest;-- Get the changed Taylor Swift Google Trends data
SELECT * FROM Taylor_Interest_Transformed | year | Month_new | Popularity |
|---|---|---|
| 2004 | 1 | 0 |
| 2004 | 2 | 0 |
| 2004 | 3 | 0 |
| 2004 | 4 | 0 |
| 2004 | 5 | 0 |
| 2004 | 6 | 0 |
| 2004 | 7 | 0 |
| 2004 | 8 | 0 |
| 2004 | 9 | 0 |
| 2004 | 10 | 0 |
-- Aggregate the totals after grouping by year
SELECT
year,
SUM(Popularity) AS Total_Interest
FROM
Taylor_Interest_Transformed
GROUP BY
year
ORDER BY
year;| year | Total_Interest |
|---|---|
| 2004 | 0.0 |
| 2005 | 0.5 |
| 2006 | 15.0 |
| 2007 | 52.0 |
| 2008 | 122.0 |
| 2009 | 265.0 |
| 2010 | 220.0 |
| 2011 | 194.0 |
| 2012 | 235.0 |
| 2013 | 220.0 |
-- Create the table to store for later
CREATE TABLE Taylor_Swift_Interest_Clean AS
SELECT
year,
SUM(Popularity) AS total_count
FROM
Taylor_Interest_Transformed
GROUP BY
year
ORDER BY
year;-- Get the FBI data
SELECT * FROM FBI_crimeTable: Displaying records 1 - 10
|Year | Population1| Violent crime2| Violent crime rate| Murder and nonnegligent manslaughter| Murder and nonnegligent manslaughter rate| Rape (revised definition)3| Rape (revised definition) rate3| Rape (legacy definition)4| Rape (legacy definition) rate4| Robbery| Robbery rate| Aggravated assault| Aggravated assault rate| Property crime| Property crime rate| Burglary| Burglary rate| Larceny- theft| Larceny- theft rate| Motor vehicle theft| Motor vehicle theft rate| …23| …24| |:—–|———–:|————-:|——————:|———————————–:|—————————————–:|————————-:|——————————:|————————:|—————————–:|——-:|————:|——————:|———————–:|————–:|——————-:|——–:|————-:|————-:|——————:|——————-:|————————:|—–:|—–:| |2000 | 281421906| 1425486| 506.5| 15586| 5.5| NA| NA| 90178| 32.0| 408016| 145.0| 911706| 324.0| 10182584| 3618.3| 2050992| 728.8| 6971590| 2477.3| 1160002| 412.2| NA| NA| |20015 | 285317559| 1439480| 504.5| 16037| 5.6| NA| NA| 90863| 31.8| 423557| 148.5| 909023| 318.6| 10437189| 3658.1| 2116531| 741.8| 7092267| 2485.7| 1228391| 430.5| NA| NA| |2002 | 287973924| 1423677| 494.4| 16229| 5.6| NA| NA| 95235| 33.1| 420806| 146.1| 891407| 309.5| 10455277| 3630.6| 2151252| 747.0| 7057379| 2450.7| 1246646| 432.9| NA| NA| |2003 | 290788976| 1383676| 475.8| 16528| 5.7| NA| NA| 93883| 32.3| 414235| 142.5| 859030| 295.4| 10442862| 3591.2| 2154834| 741.0| 7026802| 2416.5| 1261226| 433.7| NA| NA| |2004 | 293656842| 1360088| 463.2| 16148| 5.5| NA| NA| 95089| 32.4| 401470| 136.7| 847381| 288.6| 10319386| 3514.1| 2144446| 730.3| 6937089| 2362.3| 1237851| 421.5| NA| NA| |2005 | 296507061| 1390745| 469.0| 16740| 5.6| NA| NA| 94347| 31.8| 417438| 140.8| 862220| 290.8| 10174754| 3431.5| 2155448| 726.9| 6783447| 2287.8| 1235859| 416.8| NA| NA| |2006 | 299398484| 1435123| 479.3| 17309| 5.8| NA| NA| 94472| 31.6| 449246| 150.0| 874096| 292.0| 10019601| 3346.6| 2194993| 733.1| 6626363| 2213.2| 1198245| 400.2| NA| NA| |2007 | 301621157| 1422970| 471.8| 17128| 5.7| NA| NA| 92160| 30.6| 447324| 148.3| 866358| 287.2| 9882212| 3276.4| 2190198| 726.1| 6591542| 2185.4| 1100472| 364.9| NA| NA| |2008 | 304059724| 1394461| 458.6| 16465| 5.4| NA| NA| 90750| 29.8| 443563| 145.9| 843683| 277.5| 9774152| 3214.6| 2228887| 733.0| 6586206| 2166.1| 959059| 315.4| NA| NA| |2009 | 307006550| 1325896| 431.9| 15399| 5.0| NA| NA| 89241| 29.1| 408742| 133.1| 812514| 264.7| 9337060| 3041.3| 2203313| 717.7| 6338095| 2064.5| 795652| 259.2| NA| NA|
-- View Data structure. The column names have hidden characters so you'll need to use this to correctly wrangle data
PRAGMA table_info(FBI_crime);| cid | name | type | notnull | dflt_value | pk |
|---|---|---|---|---|---|
| 0 | Year | TEXT | 0 | NA | 0 |
| 1 | Population1 | REAL | 0 | NA | 0 |
| 2 | Violent | ||||
| crime2 | REAL | 0 | NA | 0 | |
| 3 | Violent |
crime rate |REAL | 0|NA | 0| |4 |Murder and nonnegligent manslaughter |REAL | 0|NA | 0| |5 |Murder and nonnegligent manslaughter rate |REAL | 0|NA | 0| |6 |Rape (revised definition)3 |REAL | 0|NA | 0| |7 |Rape (revised definition) rate3 |REAL | 0|NA | 0| |8 |Rape (legacy definition)4 |REAL | 0|NA | 0| |9 |Rape (legacy definition) rate4 |REAL | 0|NA | 0|
-- Extract the data of interest (Year & Crime Rate)
SELECT
Year AS year,
`Violent
crime
rate` AS crime_rate
FROM
FBI_crime;| year | crime_rate |
|---|---|
| 2000 | 506.5 |
| 20015 | 504.5 |
| 2002 | 494.4 |
| 2003 | 475.8 |
| 2004 | 463.2 |
| 2005 | 469.0 |
| 2006 | 479.3 |
| 2007 | 471.8 |
| 2008 | 458.6 |
| 2009 | 431.9 |
-- Extract the data of interest (Year & Crime Rate) and number the rows for the ability to remove the last 7 rows which are just footnotes.
WITH numbered_rows AS (
SELECT
Year AS year,
`Violent
crime
rate` AS crime_rate,
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS row_num
FROM FBI_crime
WHERE CAST(Year AS INTEGER) IS NOT NULL
)
SELECT year, crime_rate
FROM numbered_rows
WHERE row_num <= (SELECT COUNT(*) FROM numbered_rows) - 7| year | crime_rate |
|---|---|
| 2000 | 506.5 |
| 20015 | 504.5 |
| 2002 | 494.4 |
| 2003 | 475.8 |
| 2004 | 463.2 |
| 2005 | 469.0 |
| 2006 | 479.3 |
| 2007 | 471.8 |
| 2008 | 458.6 |
| 2009 | 431.9 |
-- Create the clean data from the FBI data and save it for later Join, fix typos from footnotes
CREATE TABLE FBI_Crime_Clean AS
WITH numbered_rows AS (
SELECT
CASE
WHEN Year = 20015 THEN 2001
WHEN Year = 20186 THEN 2018
ELSE Year
END AS year,
`Violent
crime
rate` AS crime_rate,
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS row_num
FROM FBI_crime
WHERE CAST(Year AS INTEGER) IS NOT NULL
)
SELECT year, crime_rate
FROM numbered_rows
WHERE row_num <= (SELECT COUNT(*) FROM numbered_rows) - 7Wrangled FBI Crime data ready for use.
-- Get the CDC data
SELECT * FROM Overdose_Data| Year | Number of deaths…2 | Deaths per 100,000…3 | …4 | …5 | Number of deaths…6 | …7 | Deaths per 100,000…8 | …9 | …10 | …11 | Number of deaths…12 | …13 | Deaths per 100,000…14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2001. . . . . . . . | 19394 | 6.8 | NA | NA | 12658 | NA | 9.0 | NA | NA | NA | 6736 | NA | 4.6 |
| 2002. . . . . . . . | 23518 | 8.2 | NA | NA | 15028 | NA | 10.6 | NA | NA | NA | 8490 | NA | 5.8 |
| 2003. . . . . . . . | 25785 | 8.9 | NA | NA | 16399 | NA | 11.5 | NA | NA | NA | 9386 | NA | 6.4 |
| 2004. . . . . . . . | 27424 | 9.4 | NA | NA | 17120 | NA | 11.8 | NA | NA | NA | 10304 | NA | 6.9 |
| 2005. . . . . . . . | 29813 | 10.1 | NA | NA | 18724 | NA | 12.8 | NA | NA | NA | 11089 | NA | 7.3 |
| 2006. . . . . . . . | 34425 | 11.5 | NA | NA | 21893 | NA | 14.8 | NA | NA | NA | 12532 | NA | 8.2 |
| 2007. . . . . . . . | 36010 | 11.9 | NA | NA | 22298 | NA | 14.9 | NA | NA | NA | 13712 | NA | 8.8 |
| 2008. . . . . . . . | 36450 | 11.9 | NA | NA | 22468 | NA | 14.9 | NA | NA | NA | 13982 | NA | 8.9 |
| 2009. . . . . . . . | 37004 | 11.9 | NA | NA | 22593 | NA | 14.8 | NA | NA | NA | 14411 | NA | 9.1 |
| 2010. . . . . . . . | 38329 | 12.3 | NA | NA | 23006 | NA | 15.0 | NA | NA | NA | 15323 | NA | 9.6 |
-- Remove the "....." formatting of the raw data
UPDATE Overdose_Data
SET Year = CAST(REPLACE(REPLACE(REPLACE(Year, '.', ''), ' ', ''), '-', '') AS INTEGER);-- Get the data
Select * FROM Overdose_Data| Year | Number of deaths…2 | Deaths per 100,000…3 | …4 | …5 | Number of deaths…6 | …7 | Deaths per 100,000…8 | …9 | …10 | …11 | Number of deaths…12 | …13 | Deaths per 100,000…14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2001 | 19394 | 6.8 | NA | NA | 12658 | NA | 9.0 | NA | NA | NA | 6736 | NA | 4.6 |
| 2002 | 23518 | 8.2 | NA | NA | 15028 | NA | 10.6 | NA | NA | NA | 8490 | NA | 5.8 |
| 2003 | 25785 | 8.9 | NA | NA | 16399 | NA | 11.5 | NA | NA | NA | 9386 | NA | 6.4 |
| 2004 | 27424 | 9.4 | NA | NA | 17120 | NA | 11.8 | NA | NA | NA | 10304 | NA | 6.9 |
| 2005 | 29813 | 10.1 | NA | NA | 18724 | NA | 12.8 | NA | NA | NA | 11089 | NA | 7.3 |
| 2006 | 34425 | 11.5 | NA | NA | 21893 | NA | 14.8 | NA | NA | NA | 12532 | NA | 8.2 |
| 2007 | 36010 | 11.9 | NA | NA | 22298 | NA | 14.9 | NA | NA | NA | 13712 | NA | 8.8 |
| 2008 | 36450 | 11.9 | NA | NA | 22468 | NA | 14.9 | NA | NA | NA | 13982 | NA | 8.9 |
| 2009 | 37004 | 11.9 | NA | NA | 22593 | NA | 14.8 | NA | NA | NA | 14411 | NA | 9.1 |
| 2010 | 38329 | 12.3 | NA | NA | 23006 | NA | 15.0 | NA | NA | NA | 15323 | NA | 9.6 |
-- Mutate year to lowercase year for subsequent join
ALTER TABLE Overdose_Data
RENAME COLUMN Year TO year;-- View the data
SELECT *
FROM Overdose_Data
LIMIT 5;| year | Number of deaths…2 | Deaths per 100,000…3 | …4 | …5 | Number of deaths…6 | …7 | Deaths per 100,000…8 | …9 | …10 | …11 | Number of deaths…12 | …13 | Deaths per 100,000…14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2001 | 19394 | 6.8 | NA | NA | 12658 | NA | 9.0 | NA | NA | NA | 6736 | NA | 4.6 |
| 2002 | 23518 | 8.2 | NA | NA | 15028 | NA | 10.6 | NA | NA | NA | 8490 | NA | 5.8 |
| 2003 | 25785 | 8.9 | NA | NA | 16399 | NA | 11.5 | NA | NA | NA | 9386 | NA | 6.4 |
| 2004 | 27424 | 9.4 | NA | NA | 17120 | NA | 11.8 | NA | NA | NA | 10304 | NA | 6.9 |
| 2005 | 29813 | 10.1 | NA | NA | 18724 | NA | 12.8 | NA | NA | NA | 11089 | NA | 7.3 |
-- Get the data of interest and create a table
CREATE TABLE Overdose_Data_Clean AS
SELECT
year,
`Number of deaths...2` AS number_of_deaths
FROM Overdose_Data;-- Get the wrangled CDC data
SELECT * FROM Overdose_Data_Clean| year | number_of_deaths |
|---|---|
| 2001 | 19394 |
| 2002 | 23518 |
| 2003 | 25785 |
| 2004 | 27424 |
| 2005 | 29813 |
| 2006 | 34425 |
| 2007 | 36010 |
| 2008 | 36450 |
| 2009 | 37004 |
| 2010 | 38329 |
-- Join the CDC overdose data and FBI violent crime data
SELECT
o.year,
o.number_of_deaths AS overdose_deaths,
f.crime_rate
FROM
Overdose_Data_Clean o
INNER JOIN
FBI_Crime_Clean f
ON
o.year = f.year;| year | overdose_deaths | crime_rate |
|---|---|---|
| 2002 | 23518 | 494.4 |
| 2003 | 25785 | 475.8 |
| 2004 | 27424 | 463.2 |
| 2005 | 29813 | 469.0 |
| 2006 | 34425 | 479.3 |
| 2007 | 36010 | 471.8 |
| 2008 | 36450 | 458.6 |
| 2009 | 37004 | 431.9 |
| 2010 | 38329 | 404.5 |
| 2011 | 41340 | 387.1 |
-- Join all 3 data sets and create the master table for analysis
CREATE TABLE FBI_Drugs_TaylorSwift AS
SELECT
o.year,
o.number_of_deaths AS overdose_deaths,
f.crime_rate,
t.total_count AS taylor_swift_interest
FROM
Overdose_Data_Clean o
INNER JOIN
FBI_Crime_Clean f ON o.year = f.year
INNER JOIN
Taylor_Swift_Interest_Clean t ON o.year = t.year;-- SHow the master table
SELECT * FROM FBI_Drugs_TaylorSwift| year | overdose_deaths | crime_rate | taylor_swift_interest |
|---|---|---|---|
| 2004 | 27424 | 463.2 | 0.0 |
| 2005 | 29813 | 469.0 | 0.5 |
| 2006 | 34425 | 479.3 | 15.0 |
| 2007 | 36010 | 471.8 | 52.0 |
| 2008 | 36450 | 458.6 | 122.0 |
| 2009 | 37004 | 431.9 | 265.0 |
| 2010 | 38329 | 404.5 | 220.0 |
| 2011 | 41340 | 387.1 | 194.0 |
| 2012 | 41502 | 387.8 | 235.0 |
| 2013 | 43982 | 369.1 | 220.0 |
Combined data of interest: (2004-2013) Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2013. Sources: FBI, CDC, Google Trends “Taylor Swift”
import pandas as pd
Taylor_Interest_Over_Time = pd.read_csv("Taylor_Interest_Over_Time.csv", skiprows=1)
print(Taylor_Interest_Over_Time) Month taylor swift: (United States)
0 2004-01 0
1 2004-02 0
2 2004-03 0
3 2004-04 0
4 2004-05 0
.. ... ...
245 2024-06 36
246 2024-07 31
247 2024-08 31
248 2024-09 40
249 2024-10 27
[250 rows x 2 columns]
import pandas as pd
# Transform the data to be more usable
result = Taylor_Interest_Over_Time.copy()
# Split the "Month" column into year and month
result[['year', 'Month_new']] = result['Month'].str.split('-', expand=True)
# Rename the popularity metric column
result = result.rename(columns={'taylor swift: (United States)': 'Popularity'})
# Convert "<1" to 0.5 in the Popularity column
result['Popularity'] = result['Popularity'].replace('<1', '0.5')
result['Popularity'] = pd.to_numeric(result['Popularity'])
# Convert year to numeric type
result['year'] = pd.to_numeric(result['year'])
print(result) Month Popularity year Month_new
0 2004-01 0.0 2004 01
1 2004-02 0.0 2004 02
2 2004-03 0.0 2004 03
3 2004-04 0.0 2004 04
4 2004-05 0.0 2004 05
.. ... ... ... ...
245 2024-06 36.0 2024 06
246 2024-07 31.0 2024 07
247 2024-08 31.0 2024 08
248 2024-09 40.0 2024 09
249 2024-10 27.0 2024 10
[250 rows x 4 columns]
import pandas as pd
# Aggregate Taylor Swift Data monthly data to yearly sums
Taylor_swift_popularity_yearly_sums = result.groupby('year')['Popularity'].sum().reset_index()
# Arrange by year
Taylor_swift_popularity_yearly_sums = Taylor_swift_popularity_yearly_sums.sort_values(by='year')
# Print the result
print(Taylor_swift_popularity_yearly_sums) year Popularity
0 2004 0.0
1 2005 0.5
2 2006 15.0
3 2007 52.0
4 2008 122.0
5 2009 265.0
6 2010 220.0
7 2011 194.0
8 2012 235.0
9 2013 220.0
10 2014 203.0
11 2015 275.0
12 2016 181.0
13 2017 150.0
14 2018 106.0
15 2019 126.0
16 2020 102.0
17 2021 115.0
18 2022 173.0
19 2023 564.0
20 2024 465.0
# Generate a "synonym" by converting "2004" type values for year to
"Two Thousand and Four"'Two Thousand and Four'
# Chat GPT was used with prompt "I want to convert the year column from displaying "2001" to "two thousand and one" for each year" to generate the following function
import pandas as pd
def number_to_words(num):
ones = ["", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
teens = ["ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
if num == 2000:
return "two thousand"
thousands = num // 1000
remainder = num % 1000
result = ""
if thousands > 0:
result += ones[thousands] + " thousand"
if remainder >= 100:
hundreds = remainder // 100
result += " " + ones[hundreds] + " hundred"
remainder %= 100
if remainder > 0:
if len(result) > 0:
result += " and"
if remainder < 20:
if remainder < 10:
result += " " + ones[remainder]
else:
result += " " + teens[remainder - 10]
else:
result += " " + tens[remainder // 10]
if remainder % 10 != 0:
result += " " + ones[remainder % 10]
return result.strip()
Taylor_swift_popularity_yearly_sums['year'] = Taylor_swift_popularity_yearly_sums['year'].apply(number_to_words)
# Capitalize the first letter of each word
Taylor_swift_popularity_yearly_sums['year'] = Taylor_swift_popularity_yearly_sums['year'].apply(lambda x: ' '.join(word.capitalize() for word in x.split()))
print(Taylor_swift_popularity_yearly_sums) year Popularity
0 Two Thousand And Four 0.0
1 Two Thousand And Five 0.5
2 Two Thousand And Six 15.0
3 Two Thousand And Seven 52.0
4 Two Thousand And Eight 122.0
5 Two Thousand And Nine 265.0
6 Two Thousand And Ten 220.0
7 Two Thousand And Eleven 194.0
8 Two Thousand And Twelve 235.0
9 Two Thousand And Thirteen 220.0
10 Two Thousand And Fourteen 203.0
11 Two Thousand And Fifteen 275.0
12 Two Thousand And Sixteen 181.0
13 Two Thousand And Seventeen 150.0
14 Two Thousand And Eighteen 106.0
15 Two Thousand And Nineteen 126.0
16 Two Thousand And Twenty 102.0
17 Two Thousand And Twenty One 115.0
18 Two Thousand And Twenty Two 173.0
19 Two Thousand And Twenty Three 564.0
20 Two Thousand And Twenty Four 465.0
#Convert back to handle the synonym to alighn with the numeric form of year in our other data sources
import pandas as pd
# Define the mapping dictionary with the correct format
year_mapping = {
"Two Thousand And Four": 2004,
"Two Thousand And Five": 2005,
"Two Thousand And Six": 2006,
"Two Thousand And Seven": 2007,
"Two Thousand And Eight": 2008,
"Two Thousand And Nine": 2009,
"Two Thousand And Ten": 2010,
"Two Thousand And Eleven": 2011,
"Two Thousand And Twelve": 2012,
"Two Thousand And Thirteen": 2013,
"Two Thousand And Fourteen": 2014,
"Two Thousand And Fifteen": 2015,
"Two Thousand And Sixteen": 2016,
"Two Thousand And Seventeen": 2017,
"Two Thousand And Eighteen": 2018,
"Two Thousand And Nineteen": 2019,
"Two Thousand And Twenty": 2020,
"Two Thousand And Twenty One": 2021,
"Two Thousand And Twenty Two": 2022,
"Two Thousand And Twenty Three": 2023,
"Two Thousand And Twenty Four": 2024
}
# Replace the 'year' column with numeric values
Taylor_swift_popularity_yearly_sums["year"] = Taylor_swift_popularity_yearly_sums["year"].map(year_mapping)
print(Taylor_swift_popularity_yearly_sums) year Popularity
0 2004 0.0
1 2005 0.5
2 2006 15.0
3 2007 52.0
4 2008 122.0
5 2009 265.0
6 2010 220.0
7 2011 194.0
8 2012 235.0
9 2013 220.0
10 2014 203.0
11 2015 275.0
12 2016 181.0
13 2017 150.0
14 2018 106.0
15 2019 126.0
16 2020 102.0
17 2021 115.0
18 2022 173.0
19 2023 564.0
20 2024 465.0
import pandas as pd
# Load FBI Violent Crime Data
FBI_violent_crime_per_year = pd.read_excel("FBI_violent_crime_per_year.xls", skiprows=3)
# Print the result
print(FBI_violent_crime_per_year) Year ... Unnamed: 23
0 2000 ... NaN
1 20015 ... NaN
2 2002 ... NaN
3 2003 ... NaN
4 2004 ... NaN
5 2005 ... NaN
6 2006 ...
7 2007 ... NaN
8 2008 ... NaN
9 2009 ... NaN
10 2010 ... NaN
11 2011 ... NaN
12 2012 ... NaN
13 2013 ... NaN
14 2014 ... NaN
15 2015 ... NaN
16 2016 ... NaN
17 2017 ... NaN
18 20186 ... NaN
19 2019 ... NaN
20 1 Populations are U.S. Census Bureau provision... ... NaN
21 2 The violent crime figures include the offens... ... NaN
22 3 The figures shown in this column for the off... ... NaN
23 4 The figures shown in this column for the off... ... NaN
24 5 The murder and nonnegligent homicides that o... ... NaN
25 6 The crime figures have been adjusted. ... NaN
26 NOTE: Although arson data are included in the... ... NaN
[27 rows x 24 columns]
# Get the column names
columns = FBI_violent_crime_per_year.columns
# Find the 'Year' column (case-insensitive)
year_column = [col for col in columns if 'year' in col.lower()][0]
# Find the 'Violent crime rate' column (case-insensitive and ignoring newlines)
crime_rate_column = [col for col in columns if 'violent' in col.lower() and 'crime' in col.lower() and 'rate' in col.lower()][0]
# Select crime rate and year columns and rename them
new_data_frame = FBI_violent_crime_per_year[[year_column, crime_rate_column]].copy()
new_data_frame = new_data_frame.rename(columns={year_column: "year", crime_rate_column: "crime_rate"})
# Print the result
print(new_data_frame) year crime_rate
0 2000 506.5
1 20015 504.5
2 2002 494.4
3 2003 475.8
4 2004 463.2
5 2005 469.0
6 2006 479.3
7 2007 471.8
8 2008 458.6
9 2009 431.9
10 2010 404.5
11 2011 387.1
12 2012 387.8
13 2013 369.1
14 2014 361.6
15 2015 373.7
16 2016 386.6
17 2017 383.8
18 20186 370.4
19 2019 366.7
20 1 Populations are U.S. Census Bureau provision... NaN
21 2 The violent crime figures include the offens... NaN
22 3 The figures shown in this column for the off... NaN
23 4 The figures shown in this column for the off... NaN
24 5 The murder and nonnegligent homicides that o... NaN
25 6 The crime figures have been adjusted. NaN
26 NOTE: Although arson data are included in the... NaN
# Drop the last 7 rows as these are footnotes
FBI_Crime_Rate = new_data_frame.iloc[:-7].copy()
# Fix the typos in the year column introduced by the footnotes
FBI_Crime_Rate['year'] = FBI_Crime_Rate['year'].replace({20186: 2018, 20015: 2001})
# Convert year to integer type
FBI_Crime_Rate['year'] = FBI_Crime_Rate['year'].astype(int)
# Print the result
print(FBI_Crime_Rate) year crime_rate
0 2000 506.5
1 20015 504.5
2 2002 494.4
3 2003 475.8
4 2004 463.2
5 2005 469.0
6 2006 479.3
7 2007 471.8
8 2008 458.6
9 2009 431.9
10 2010 404.5
11 2011 387.1
12 2012 387.8
13 2013 369.1
14 2014 361.6
15 2015 373.7
16 2016 386.6
17 2017 383.8
18 20186 370.4
19 2019 366.7
# Convert the year column to numeric
FBI_Crime_Rate['year'] = pd.to_numeric(FBI_Crime_Rate['year'], errors='coerce')
# View the result and check the structure
print(FBI_Crime_Rate) year crime_rate
0 2000 506.5
1 20015 504.5
2 2002 494.4
3 2003 475.8
4 2004 463.2
5 2005 469.0
6 2006 479.3
7 2007 471.8
8 2008 458.6
9 2009 431.9
10 2010 404.5
11 2011 387.1
12 2012 387.8
13 2013 369.1
14 2014 361.6
15 2015 373.7
16 2016 386.6
17 2017 383.8
18 20186 370.4
19 2019 366.7
# Perform an inner join and rename columns in the combined dataframe
combined_data = FBI_Crime_Rate.merge(Taylor_swift_popularity_yearly_sums, on='year', how='inner')
combined_data = combined_data.rename(columns={
'crime_rate': 'FBI Violent Crime',
'total_count': 'Taylor Swift Popularity'
})
# View the result
print(combined_data) year FBI Violent Crime Popularity
0 2004 463.2 0.0
1 2005 469.0 0.5
2 2006 479.3 15.0
3 2007 471.8 52.0
4 2008 458.6 122.0
5 2009 431.9 265.0
6 2010 404.5 220.0
7 2011 387.1 194.0
8 2012 387.8 235.0
9 2013 369.1 220.0
10 2014 361.6 203.0
11 2015 373.7 275.0
12 2016 386.6 181.0
13 2017 383.8 150.0
14 2019 366.7 126.0
import pandas as pd
Overdose_table = pd.read_csv('overdose_table.csv')
print(Overdose_table) year number_of_deaths
0 2001. . . . . . . . 19394
1 2002. . . . . . . . 23518
2 2003. . . . . . . . 25785
3 2004. . . . . . . . 27424
4 2005. . . . . . . . 29813
5 2006. . . . . . . . 34425
6 2007. . . . . . . . 36010
7 2008. . . . . . . . 36450
8 2009. . . . . . . . 37004
9 2010. . . . . . . . 38329
10 2011 . . . . . . . . 41340
11 2012. . . . . . . . 41502
12 2013. . . . . . . . 43982
13 2014. . . . . . . . 47055
14 2015. . . . . . . . 52404
15 2016. . . . . . . . 63632
16 2017. . . . . . . . 70237
17 2018. . . . . . . . 67367
18 2019. . . . . . . . 70630
19 2020. . . . . . . . 91799
20 2021. . . . . . . . 106699
import pandas as pd
import re
# Remove "....." from year entries and convert to integer
Overdose_table['year'] = Overdose_table['year'].apply(lambda x: re.sub(r'[^\d]', '', str(x)))
Overdose_table['year'] = pd.to_numeric(Overdose_table['year'], errors='coerce').astype('Int64')
# Print the result
print(Overdose_table) year number_of_deaths
0 2001 19394
1 2002 23518
2 2003 25785
3 2004 27424
4 2005 29813
5 2006 34425
6 2007 36010
7 2008 36450
8 2009 37004
9 2010 38329
10 2011 41340
11 2012 41502
12 2013 43982
13 2014 47055
14 2015 52404
15 2016 63632
16 2017 70237
17 2018 67367
18 2019 70630
19 2020 91799
20 2021 106699
# Select only Total deaths and Year columns
Overdose_table_subset = Overdose_table.iloc[:, :2].copy()
# Rename columns for clarity
Overdose_table_subset = Overdose_table_subset.rename(columns={Overdose_table_subset.columns[0]: 'year', Overdose_table_subset.columns[1]: 'number_of_deaths'})
# Display the first few rows of the new DataFrame
print(Overdose_table_subset.head()) year number_of_deaths
0 2001 19394
1 2002 23518
2 2003 25785
3 2004 27424
4 2005 29813
# Check the structure of the new DataFrame
print(Overdose_table_subset.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 21 non-null Int64
1 number_of_deaths 21 non-null int64
dtypes: Int64(1), int64(1)
memory usage: 485.0 bytes
None
import pandas as pd
# Join with Previous combined data and rename data
TaylorSwift_ViolentCrime_DrugOverdose = combined_data.merge(Overdose_table_subset, on='year', how='left')
TaylorSwift_ViolentCrime_DrugOverdose = TaylorSwift_ViolentCrime_DrugOverdose.rename(columns={'number_of_deaths': 'Drug Overdoses'})
# Display the first few rows of the joined dataframe
print(TaylorSwift_ViolentCrime_DrugOverdose) year FBI Violent Crime Popularity Drug Overdoses
0 2004 463.2 0.0 27424
1 2005 469.0 0.5 29813
2 2006 479.3 15.0 34425
3 2007 471.8 52.0 36010
4 2008 458.6 122.0 36450
5 2009 431.9 265.0 37004
6 2010 404.5 220.0 38329
7 2011 387.1 194.0 41340
8 2012 387.8 235.0 41502
9 2013 369.1 220.0 43982
10 2014 361.6 203.0 47055
11 2015 373.7 275.0 52404
12 2016 386.6 181.0 63632
13 2017 383.8 150.0 70237
14 2019 366.7 126.0 70630
Combined data of interest: (2004-2019) Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”
import pandas as pd
import matplotlib.pyplot as plt
data = TaylorSwift_ViolentCrime_DrugOverdose
df = pd.DataFrame(data)
# Function to scale values to range (0–100)
def scale_0_100(x):
return ((x - x.min()) / (x.max() - x.min())) * 100
# Apply scaling to all columns except 'year'
scaled_data = df.copy()
for col in scaled_data.columns[1:]:
scaled_data[col] = scale_0_100(scaled_data[col])# Reshape the scaled data for plotting
scaled_data_long = scaled_data.melt(id_vars=['year'], var_name='variable', value_name='value')# Create a time series line plot of the scaled values
plt.figure(figsize=(12, 6))
# Plot each metric's scaled value
for variable in scaled_data_long['variable'].unique():
subset = scaled_data_long[scaled_data_long['variable'] == variable]
plt.plot(subset['year'], subset['value'], marker='o', label=variable)
# Add labels and title
plt.title('Scaled Values of FBI Violent Crime,\nPopularity and Drug Overdoses (2004-2019)', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Values (0-100)', fontsize=12)
plt.xticks(scaled_data_long['year'].unique(), rotation=45)([<matplotlib.axis.XTick object at 0x7fb163ee36d0>, <matplotlib.axis.XTick object at 0x7fb163ee3d60>, <matplotlib.axis.XTick object at 0x7fb163ef8550>, <matplotlib.axis.XTick object at 0x7fb161bb1310>, <matplotlib.axis.XTick object at 0x7fb161ba2130>, <matplotlib.axis.XTick object at 0x7fb161c1b9d0>, <matplotlib.axis.XTick object at 0x7fb161c0e850>, <matplotlib.axis.XTick object at 0x7fb161bb1dc0>, <matplotlib.axis.XTick object at 0x7fb161ba2df0>, <matplotlib.axis.XTick object at 0x7fb161c1b550>, <matplotlib.axis.XTick object at 0x7fb161bbcf70>, <matplotlib.axis.XTick object at 0x7fb161bc1a60>, <matplotlib.axis.XTick object at 0x7fb161bc7550>, <matplotlib.axis.XTick object at 0x7fb161bc1190>, <matplotlib.axis.XTick object at 0x7fb161bc7ee0>], [Text(2004, 0, '2004'), Text(2005, 0, '2005'), Text(2006, 0, '2006'), Text(2007, 0, '2007'), Text(2008, 0, '2008'), Text(2009, 0, '2009'), Text(2010, 0, '2010'), Text(2011, 0, '2011'), Text(2012, 0, '2012'), Text(2013, 0, '2013'), Text(2014, 0, '2014'), Text(2015, 0, '2015'), Text(2016, 0, '2016'), Text(2017, 0, '2017'), Text(2019, 0, '2019')])
plt.axhline(0, color='gray', linestyle='--', linewidth=1) # Add a baseline at y=0
# Add legend and grid
plt.legend(title='Metrics')
plt.grid(True)
# Show plot
plt.tight_layout()
plt.show()import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Define the data
data = TaylorSwift_ViolentCrime_DrugOverdose
# Create the DataFrame
df = pd.DataFrame(data)
# Function to scale values to range (0–100)
def scale_0_100(x):
return ((x - x.min()) / (x.max() - x.min())) * 100
# Apply scaling to all columns except 'year'
scaled_data = df.copy()
for col in scaled_data.columns[1:]:
scaled_data[col] = scale_0_100(scaled_data[col])
# Reshape data for plotting
scaled_data_long = scaled_data.melt(id_vars=['year'], var_name='variable', value_name='value')
# Create a bar chart
plt.figure(figsize=(14, 7))
# Set bar width and positions
bar_width = 0.25
x_positions = np.arange(len(scaled_data['year']))
# Plot each variable as a separate set of bars
for i, variable in enumerate(scaled_data.columns[1:]):
plt.bar(x_positions + i * bar_width,
scaled_data[variable],
width=bar_width,
label=variable)
# Add labels and title
plt.title('Scaled Values of FBI Violent Crime,\nPopularity and Drug Overdoses (2004-2019)', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Values (0-100)', fontsize=12)
plt.xticks(x_positions + bar_width / len(scaled_data.columns[1:]), scaled_data['year'], rotation=45)([<matplotlib.axis.XTick object at 0x7fb161b15cd0>, <matplotlib.axis.XTick object at 0x7fb161b15e50>, <matplotlib.axis.XTick object at 0x7fb161b0cb50>, <matplotlib.axis.XTick object at 0x7fb161a35f10>, <matplotlib.axis.XTick object at 0x7fb161a44a30>, <matplotlib.axis.XTick object at 0x7fb161a49520>, <matplotlib.axis.XTick object at 0x7fb161a888e0>, <matplotlib.axis.XTick object at 0x7fb161a35640>, <matplotlib.axis.XTick object at 0x7fb161a49d00>, <matplotlib.axis.XTick object at 0x7fb161a527f0>, <matplotlib.axis.XTick object at 0x7fb161a572e0>, <matplotlib.axis.XTick object at 0x7fb161a356a0>, <matplotlib.axis.XTick object at 0x7fb161a57be0>, <matplotlib.axis.XTick object at 0x7fb161a5f6d0>, <matplotlib.axis.XTick object at 0x7fb161a5ff40>], [Text(0.08333333333333333, 0, '2004'), Text(1.0833333333333333, 0, '2005'), Text(2.0833333333333335, 0, '2006'), Text(3.0833333333333335, 0, '2007'), Text(4.083333333333333, 0, '2008'), Text(5.083333333333333, 0, '2009'), Text(6.083333333333333, 0, '2010'), Text(7.083333333333333, 0, '2011'), Text(8.083333333333334, 0, '2012'), Text(9.083333333333334, 0, '2013'), Text(10.083333333333334, 0, '2014'), Text(11.083333333333334, 0, '2015'), Text(12.083333333333334, 0, '2016'), Text(13.083333333333334, 0, '2017'), Text(14.083333333333334, 0, '2019')])
# Add legend and grid
plt.legend(title='Metrics')
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Show plot
plt.tight_layout()
plt.show()import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Define the data
data = TaylorSwift_ViolentCrime_DrugOverdose
# Create the DataFrame
df = pd.DataFrame(data)
# Function to scale values to range (0–100)
def scale_0_100(x):
return ((x - x.min()) / (x.max() - x.min())) * 100
# Apply scaling to the columns
scaled_data = df.copy()
scaled_data["Popularity"] = scale_0_100(scaled_data["Popularity"])
scaled_data["Drug Overdoses"] = scale_0_100(scaled_data["Drug Overdoses"])
# Create a Scatter Plot
plt.figure(figsize=(12, 6))
plt.scatter(scaled_data["year"], scaled_data["Popularity"], label="Popularity", color='blue', marker='o')
plt.scatter(scaled_data["year"], scaled_data["Drug Overdoses"], label="Drug Overdoses", color='green', marker='x')
# Add trendlines
for variable in ["Popularity", "Drug Overdoses"]:
z = np.polyfit(scaled_data["year"], scaled_data[variable], deg=1) # Linear fit
p = np.poly1d(z)
plt.plot(scaled_data["year"], p(scaled_data["year"]), label=f"{variable} Trendline", linestyle='--')
# Add labels and title
plt.title('Scaled Values Change Over Time for Taylor Swift Popularity and Drug Overdoses', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Values (0-100)', fontsize=12)
plt.xticks(scaled_data["year"], rotation=45)([<matplotlib.axis.XTick object at 0x7fb1619b3790>, <matplotlib.axis.XTick object at 0x7fb1619b3670>, <matplotlib.axis.XTick object at 0x7fb16198ac40>, <matplotlib.axis.XTick object at 0x7fb161983760>, <matplotlib.axis.XTick object at 0x7fb161979760>, <matplotlib.axis.XTick object at 0x7fb161970a90>, <matplotlib.axis.XTick object at 0x7fb161983fa0>, <matplotlib.axis.XTick object at 0x7fb16198a7f0>, <matplotlib.axis.XTick object at 0x7fb1619b9d00>, <matplotlib.axis.XTick object at 0x7fb161997e80>, <matplotlib.axis.XTick object at 0x7fb1619a0970>, <matplotlib.axis.XTick object at 0x7fb16198a9d0>, <matplotlib.axis.XTick object at 0x7fb1619292b0>, <matplotlib.axis.XTick object at 0x7fb161929d60>, <matplotlib.axis.XTick object at 0x7fb16192e850>], [Text(2004, 0, '2004'), Text(2005, 0, '2005'), Text(2006, 0, '2006'), Text(2007, 0, '2007'), Text(2008, 0, '2008'), Text(2009, 0, '2009'), Text(2010, 0, '2010'), Text(2011, 0, '2011'), Text(2012, 0, '2012'), Text(2013, 0, '2013'), Text(2014, 0, '2014'), Text(2015, 0, '2015'), Text(2016, 0, '2016'), Text(2017, 0, '2017'), Text(2019, 0, '2019')])
plt.axhline(0, color='gray', linestyle='--', linewidth=1) # Add a baseline at y=0
# Add legend and grid
plt.legend(title='Metrics')
plt.grid(True)
# Show plot
plt.tight_layout()
plt.show()Note: More analysis can be found under each code tab set, here we just show a summary.
Because our data all have different ranges we must scale the data to show percent changes over time as opposed to the raw data. For example the Violent crime rate is on the order of hundreds while the drug overdose deaths are on the order of 100,000. If the data is not scaled the temporal relationship between violent crime, drug use, and Taylor Swift Popularity is obscured, exactly as Taylor Swifts wants. As Data Scientists with a firm grasp on mathematics we are not so easily fooled.
#Scale all data to range from 0-100 for graphical visualization of relative trends per dataset
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(tidyr)
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = year, y = value, color = variable)) +
geom_line(size = 1) + # Line plot
geom_point(size = 3) + # Add points for each data point
labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses Over Time",
x = "Year", y = "Scaled Values") +
theme_minimal() + # Clean theme
scale_color_manual(values = c("red", "pink", "black")) + # Custom colors
theme(legend.title = element_blank()) + # Remove legend title
theme(panel.grid.major = element_line(color = "gray", size = 0.5)) # Add gridlinesFigure(1) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift”
#Same comparison but visualized with grouped bar chart
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
mutate(across(-year, scale_0_100))
# Step 2: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data %>%
pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 3: Create the grouped bar plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
labs(title = "Scaled Values of Violent Crime, Popularity, and Drug Overdoses by Year",
x = "Year",
y = "Scaled Values") +
scale_fill_manual(values = c("FBI Violent Crime" = "red",
"Taylor Swift Popularity" = "pink",
"Drug Overdoses" = "black"),
name = "Metric") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom") +
scale_y_continuous(limits = c(0, 100)) +
geom_text(aes(label = round(value, 1)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3)Figure(2) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift
# Step 1: Scale all columns except 'year' from 0 to 100
scale_0_100 <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100)
}
# Apply scaling to all columns except 'year'
scaled_data <- TaylorSwift_ViolentCrime_DrugOverdose %>%
mutate(across(-year, scale_0_100))
# Step 2: Select only the columns for 'Taylor Swift Popularity' and 'Drug Overdoses'
scaled_data_subset <- scaled_data %>%
select(year, `Taylor Swift Popularity`, `Drug Overdoses`)
# Step 3: Reshape data for plotting (from wide to long format)
scaled_data_long <- scaled_data_subset %>%
pivot_longer(cols = -year, names_to = "variable", values_to = "value")
# Step 4: Create the plot using ggplot2
ggplot(scaled_data_long, aes(x = factor(year), y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.8) +
labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses by Year",
x = "Year",
y = "Scaled Values") +
scale_fill_manual(values = c("Taylor Swift Popularity" = "pink",
"Drug Overdoses" = "black"),
name = "Metric") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom") +
scale_y_continuous(limits = c(0, 100)) +
geom_text(aes(label = round(value, 1)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3)Figure(3) Grouped Bar Chart showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift
Here we see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.
#Show correlation between drug overdose and popularity with a scatterplot/trendline
# Create the plot
ggplot(scaled_data, aes(x = year)) +
# Scatter plot for Taylor Swift Popularity
geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
# Scatter plot for Drug Overdoses
geom_point(aes(y = `Drug Overdoses`, color = "Drug Overdoses"), size = 3, shape = 15) +
# Fitted line for Taylor Swift Popularity
geom_smooth(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"),
method = "lm", se = FALSE, linetype = "dashed") +
# Fitted line for Drug Overdoses
geom_smooth(aes(y = `Drug Overdoses`, color = "Drug Overdoses"),
method = "lm", se = FALSE, linetype = "dashed") +
# Customize colors
scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "Drug Overdoses" = "black")) +
# Labels and title
labs(title = "Scaled Values of Taylor Swift Popularity and Drug Overdoses Over Time",
x = "Year",
y = "Scaled Value (0 to 100)",
color = "") +
# Customize the theme
theme_minimal() +
theme(legend.position = "top",
panel.grid.major = element_line(color = "gray", linetype = "dotted"),
axis.text.x = element_text(angle = 0, hjust = 0.5)) +
# Set y-axis limits
scale_y_continuous(limits = c(0, 100)) +
# Set x-axis to show all years
scale_x_continuous(breaks = scaled_data$year)`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_smooth()`).
Figure(4) Scatterplot with trendline showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift
We again see a clear year by year correlation between Tatylor Swift’s rise in popularity and drug overdoses in The United States of America. It’s horrific that Ms. Swift’s music is causing these deaths, but such is life.
# Create the plot
ggplot(scaled_data, aes(x = year)) +
# Line and points for Taylor Swift Popularity
geom_line(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 1) +
geom_point(aes(y = `Taylor Swift Popularity`, color = "Taylor Swift Popularity"), size = 3) +
# Line and points for Violent Crime Rate
geom_line(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 1) +
geom_point(aes(y = `FBI Violent Crime`, color = "FBI Violent Crime"), size = 3, shape = 15) +
# Customize colors
scale_color_manual(values = c("Taylor Swift Popularity" = "pink", "FBI Violent Crime" = "red")) +
# Labels and title
labs(title = "Scaled Values of Taylor Swift Popularity and FBI Violent Crime Over Time",
x = "Year",
y = "Scaled Value (0 to 100)",
color = "") +
# Customize the theme
theme_minimal() +
theme(legend.position = "top",
panel.grid = element_blank(), # Remove all grid lines
axis.text.x = element_text(angle = 0, hjust = 0.5)) +
# Set y-axis limits
scale_y_continuous(limits = c(0, 100)) +
# Set x-axis to show all years
scale_x_continuous(breaks = scaled_data$year)Figure(5) Scatterplot showing Drug Overdoses, Violent Crime, and Taylor Swift’s Popularity from 2004-2024. Sources: FBI, CDC, Google Trends “Taylor Swift
Shockingly Ms. Swifts Music seems to have a calming affect on American citizens. Perhaps Taylor is sending Americans into a drug-ridden stupor.
Altogether,this work provides unequivocal evidence of Taylor Swifts corruption of America. Ms. Swift’s music lulls masses into a a drug ridden complaceny.
Future work needs to investigate the negative correlation between Ms. Swift’s music and Violent Crime Rate in America. One plausible explanation is that Ms. Swift’s music triggers downregulation of testosterone production in males, leading to reduced violent crime.
Most of the challenges with the wrangling were introduced by raw data formatting issues. For example footnotes in the FBI crime data introduced typos that needed mutation to remove the additional number. Additionally the FBI data had invisible charaters in the column names which made it confusing when trying to select or extract specific columns. The Google Trends data required wider separation, grouping by year, and aggregation to yield usable data. The CDC overdose data needed extraction of only specific columns.
It was also challenging to join the data sources into a single dataframe as some data was of the structure numeric while others were characters. This required conversion of characters to numeric for subsequent calculation during analysis. SQL was particularly difficult at first as I was unfamiliar with how to setup the database connection.
Compiling the python, r, and sql code into one html document was challenging and required quite a bit of formatting to get to a point where I was happy with it.
It was challenging to visualize the trends the data during analysis due to the multiple order of magnitude difference in the scales of our data. For example, violent crime rate has value on the scale of hundreds while overdose deaths per year are in the hundreds of thousands. This was fixed by applying a scaling function to have all data range from 0 to 100 to show year over year percent changes.
The data analysis pipeline was successfully implemented in R, SQL, Python, & Excel. In addition the project was compiled into one master html document to present all information in the various languages (excel attached separate). Before this project I had never coded in Python past “Hello World”, never had used R, never even heard of SQL, and had certainly never generated an html document. I now feel confident using all of these tools. This project also displays a grasp of statistical analysis and how to analyze data. The inherent disconnection between the data in this project made the analysis a creative process.
I also feel comfortable using terms like extract, separate wider/longer, aggregate, group by, etc when talking about data wrangling during my professional and conversational life.
For particular challenging problems such as the initial generation of the year synonym for the Taylor Swift Google Trends data, data scaling, and graphical analyiss AI was used (ChatGPT) to assist in writing functions. This was empowering as AI can be a powerful collaborator.